How many GPUs can Ollama use simultaneously?

Okay, let's break down how many GPUs Ollama can use simultaneously. The answer is nuanced and depends on several factors, but here's a comprehensive overview:

1. Theoretical Maximum: 8 GPUs

Ollama is designed to leverage multiple GPUs effectively. The theoretical maximum number of GPUs it can utilize simultaneously is 8. This is based on the way it's architected for distributed inference.

2. Practical Considerations & Performance

Model Size: This is the biggest factor. Larger models (like 70B parameter models) require significantly more GPU memory and therefore benefit far more from multiple GPUs. Smaller models (e.g., 7B) might only require 2-4 GPUs to run effectively.
GPU Memory (VRAM): Each GPU needs enough VRAM to hold the model. If the model is too large for a single GPU, Ollama will automatically distribute it across multiple GPUs.
Interconnect Speed: The speed of the connection between your GPUs (e.g., NVLink, PCIe) drastically impacts performance. NVLink provides the fastest speeds, while PCIe will be slower.
Batch Size: The number of prompts processed simultaneously (batch size) affects GPU utilization. A larger batch size can improve throughput but requires more VRAM.
Ollama's Distributed Inference: Ollama uses a sophisticated distributed inference system. It automatically shards the model across GPUs and manages the communication between them.

3. Recommended Configurations

Here's a general guideline, recognizing that this is approximate and depends on your specific use case:

2-4 GPUs: Ideal for models like Llama 2 7B, Mistral 7B, and smaller variants. You'll likely see good performance.
4-8 GPUs: Recommended for larger models like Llama 2 34B, 70B, and potentially some fine-tuned models. You'll get significantly faster inference times.
8+ GPUs: For extremely large models or for massive throughput demands, you'll need a higher number of GPUs, and you'll likely be setting up a more complex distributed setup.

4. How to Configure in Ollama

--gpu-memory: You can explicitly specify the amount of memory each GPU should use: ollama run mymodel --gpu-memory 20G (This allocates 20GB of VRAM per GPU). Ollama will try to distribute the model evenly.
--gpu-count: You can also set the number of GPUs to use: ollama run mymodel --gpu-count 4

5. Key Takeaway:

Ollama is designed to handle multiple GPUs. However, it’s crucial to consider your model size, GPU memory, interconnect speed, and desired performance when setting up your hardware. Start with a smaller number of GPUs and increase as needed to maximize throughput while staying within your GPU memory constraints.

Resources:

Ollama Documentation - Distributed Inference: https://ollama.ai/docs/distributed-inference
Ollama GitHub Repository - Distributed Inference: https://github.com/ollama/ollama/issues?q=%2Bdistributed+inference

To give you more tailored advice, could you tell me:

What model are you planning to run? (e.g., Llama 2 7B, Mistral 7B, etc.)
What are your GPU specs (brand and model)? (e.g., NVIDIA RTX 3090, RTX 4090, etc.)

Back to the List